Summer 2025, Pre-Assignment
Part 3: Introduction to Data Visualization
# INSTALLATION CODE:
# This code block installs a few extra packages you will need
# This may take a few minutes, the icon on the left will spin.
# When it stops spinning it is complete.
# When we get to meet in-person, we are going to learn other forms to load your
# packages.
pkgs <- c("tidyverse", "ggrepel", "gapminder", "maps", "ggthemes")
to_install <- which(!(pkgs %in% rownames(installed.packages())))
install.packages(pkgs[to_install])
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.0 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggrepel)
library(gapminder)
library(maps)
##
## Attaching package: 'maps'
##
## The following object is masked from 'package:purrr':
##
## map
# Datasets like iris, mpg, gapminder, etc. are all available for you to use here.
# Here, we recreate the Asia dataset in the tutorial for you.
# Filter down to relevant countries
asia <- gapminder |>
filter(
country %in% c("China", "Japan", "Korea, Rep.", "Korea, Dem. Rep."))
# Rename four Asian countries to use
asia <- asia |>
mutate(
country = case_when(
country == "Korea, Rep." ~ "South Korea",
country == "Korea, Dem. Rep." ~ "North Korea",
country == "China" ~ "China",
country == "Japan" ~ "Japan")
)
cat("Done!")
## Done!
# If needed:
# TAKE NOTES HERE for the Primers
As usual, once you have completed the RStudio tutorials above, please start here and continue the document below. These exercises give you additional opportunities to practice the most important concepts from the RStudio Tutorials for our HKS courses.
Looking at a dataset is nice, but we will often want to visualize our data. R has incredibly powerful tools for data visualization.
To start, load the tidyverse library and read this
dataset on US presidential election results from 1932 to 2016 by
state.
# Run this code, loads libraries and does some setup
library(tidyverse)
options(repr.plot.width=10, repr.plot.height=10)
theme_set(theme_gray(base_size = 20))
theme_set(theme_dark(base_size = 20))
theme_set(theme_linedraw(base_size = 30, ))
update_geom_defaults("point",list(size=5))
update_geom_defaults("line",list(lwd=1.5))
elections <- read_csv("https://www.dropbox.com/s/lhp9nets5qb2rhe/presidential_elections.csv?dl=1")
## Rows: 1097 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): state, abb, region
## dbl (2): democrat, year
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# show first 10 rows
head(elections, n = 10)
## # A tibble: 10 × 5
## state abb democrat year region
## <chr> <chr> <dbl> <dbl> <chr>
## 1 Alabama AL 84.8 1932 South
## 2 Arizona AZ 67.0 1932 West
## 3 Arkansas AR 86.3 1932 South
## 4 California CA 58.4 1932 West
## 5 Colorado CO 54.8 1932 West
## 6 Connecticut CT 47.4 1932 Northeast
## 7 Delaware DE 48.1 1932 South
## 8 Florida FL 74.5 1932 South
## 9 Georgia GA 91.6 1932 South
## 10 Idaho ID 58.7 1932 West
Let’s look at the results for Massachusetts:
# Don't forget to run the code above first to read in the data
ma <- elections |>
filter(state == "Massachusetts")
head(ma)
## # A tibble: 6 × 5
## state abb democrat year region
## <chr> <chr> <dbl> <dbl> <chr>
## 1 Massachusetts MA 50.6 1932 Northeast
## 2 Massachusetts MA 51.2 1936 Northeast
## 3 Massachusetts MA 53.1 1940 Northeast
## 4 Massachusetts MA 52.8 1944 Northeast
## 5 Massachusetts MA 54.7 1948 Northeast
## 6 Massachusetts MA 45.5 1952 Northeast
# Practice by creating an object for a different state or time period.
As you might expect, there are many functions for plotting! As you
learned in the RStudio tutorials, the starting point for every plot we
make in this course is called ggplot().
Starting with a dataset, you can create a plot with year
on the x-axis and democrat on the y-axis with:
ma |>
ggplot(
aes(
x = year,
y = democrat
)
)
ggplot() makes a plot for you, and the
aes() function (short for “aesthetic”) describes the
variables in the dataset that you want on the x and y axis (for now! We
can use aes() for other things too later).
But it’s empty! To get shapes to appear on the plot, we need to ask
for a particular geom (short for “geometry”). A
geom in R is a way to visualize the data, like a point, a
line, or a shape. To further customize this plot, we simply add a
geom for the shape we want. Let’s use geom_line()
to make a line:
Hint: if the plot below looks too small on your computer, you can click the “show in new window” icon at the top right corner of the plot.
ma |>
ggplot(
aes(
x = year,
y = democrat
)
) +
geom_line()
# Try creating a line plot for a different state
Notice the + sign! We add a + sign between
different pieces of a plot.
We could keep almost this exact code for a plot with a different geometry for points as well:
ma |>
ggplot(
aes(
x = year,
y = democrat
)
) +
geom_point()
You can also add both! Notice how the points appear on top of the line, since we added them after:
ma |>
ggplot(
aes(
x = year,
y = democrat
)
) +
geom_line() +
geom_point()
# Write your code here
geom_col() instead
of points or lines.# Write your code here
Answer here in text!
We added an x and y aesthetic to choose
particular columns to display on our axes, but plots can accept many
other arguments.
As you saw in the RStudio tutorials, if you want to make your geoms a
certain color, that is very easy to do with the color
argument:
ma |>
ggplot(
aes(
x = year,
y = democrat
)
) +
geom_line(color = "grey") +
geom_point(color = "blue")
This looks great, but what if we want the colors in our plots to depend on the value of the data? For example, red points for elections that Republicans won and blue for elections that Democrats won.
Then, people looking at our plot would see additional pieces of information beyond the values on the x and y axes.
Just like the x and y axes, if we want the color of the points to depend on values in the data we have to use a column in our dataset to define the colors. Let’s make a new column that shows whether the Democratic candidate won the election.
For a crude measure of the election winner, let’s use whether
democrat is greater than 50 percent (this is too simple
since more than two candidates can run, but it’s okay for now).
# Create a new column for a Democratic winner
ma <- ma |>
mutate(
winner = democrat > 50
)
head(ma)
## # A tibble: 6 × 6
## state abb democrat year region winner
## <chr> <chr> <dbl> <dbl> <chr> <lgl>
## 1 Massachusetts MA 50.6 1932 Northeast TRUE
## 2 Massachusetts MA 51.2 1936 Northeast TRUE
## 3 Massachusetts MA 53.1 1940 Northeast TRUE
## 4 Massachusetts MA 52.8 1944 Northeast TRUE
## 5 Massachusetts MA 54.7 1948 Northeast TRUE
## 6 Massachusetts MA 45.5 1952 Northeast FALSE
Remember how this code works: the column democrat in
ma is really a vector. The code works very similarly to
running something like:
democrat <- c(52, 37, 63)
democrat > 50
## [1] TRUE FALSE TRUE
If you want the color of the points to depend on the value of a
column, then you can use the color argument in the
aes() function. R will assign one color to each value in
the winner vector. Since there are only TRUE
and FALSE values in this column, all of the
TRUE values will have one color and FALSE will
have another.
ma |>
ggplot(
aes(
x = year,
y = democrat,
color = winner
)
) +
geom_point()
What if we add the line back?
ma |>
ggplot(
aes(
x = year,
y = democrat,
color = winner)
) +
geom_point() +
geom_line()
Uh-oh! What’s happening here? Well, we’ve asked the plot to change
the color of our shapes according to the
winner variable. Since we have both points and a line, the
plot is trying to change the color of both.
What if we only want to change the color of the points depending on
the value of winner? Well, we can include that aesthetic
only in the geom_point() function.
ma |>
ggplot(
aes(
x = year,
y = democrat
)
) +
geom_line() +
geom_point(aes(color = winner))
Like before, you can still set the color of the line manually since
you don’t want the color to vary by the value of a column. Make sure to
do this outside of aes():
ma|>
ggplot(
aes(
x = year,
y = democrat
)
) +
geom_line(color = "grey") +
geom_point(aes(color = winner))
Similarly, you can have the size of a point depend
on the value of a column. For example, see how values with a
winner value of TRUE are larger below than
values with FALSE:
ma|>
ggplot(aes(x = year, y = democrat)) +
geom_line(color = "grey") +
geom_point(aes(size = winner))
## Warning: Using size for a discrete variable is not advised.
Now, points are larger for larger values of winner!
However, larger values of winner are already higher up on
the y-axis, so this does not add much information to our plot.
The same is true for shape:
ma|>
ggplot(
aes(
x = year,
y = democrat
)
) +
geom_line(color = "grey") +
geom_point(aes(shape = winner))
ma dataset called
percent. The values should be equal the values in
democrat divided by 100.# Write your code here.
ma object with
year on the x-axis and percent on the
y-axis.# Write your code here.
ma called modern
which is TRUE for all elections after 1980 and
FALSE for those before. Create a plot with
year on the x-axis, democrat on the y-axis,
color the points by winner, and vary the shape by
modern.# Write your code here.
Geometries and aesthetics are the core of a nice visualization. R gives you many many more tools to customize your plots any way you want. For example:
Labels are important in any plot. We create these
with the labs() function, which has arguments for
title, subitle, caption,
x, and y labels. You can choose which labels
to include in your plot. For example:
ma |>
ggplot(
aes(
x = year,
y = democrat
)
) +
geom_line(color = "grey") +
geom_point(aes(color = winner)) +
labs(
title = "Massachusetts Presidential \n Election Results",
subtitle = "1932-2016",
x = "Election Year",
y = "Democratic %"
)
You can also set your own axes in R – the minimum and maximum values on the x (horizontal) axis and y (vertical) axis. R will often try to pick them for you automatically, but sometimes you may want to choose your own.
The xlim() and ylim() functions will take a
vector (specified by c()) with the smallest and largest
values you want for that axis.
For example, R automatically chose a y-axis for the previous plot
that stretched from around 40 to 70 because that’s where our values
were. However, what if we wanted to make that go from 0 to 100? We could
change ylim() like this:
ma |>
ggplot(
aes(
x = year,
y = democrat
)
) +
geom_line(color = "grey") +
geom_point(aes(color = winner)) +
labs(title = "Massachussets Presidential \n Election Results",
subtitle = "1932-2016",
x = "Election Year",
y = "Democratic %") +
ylim(c(0, 100)) # now 0 is the minimum, 100 is the maximum
Themes are simple ways to improve the presentation
of your plot as well. We will learn how to make our own later, but for
now you can use built-in themes. Some built-in themes include
theme_bw(), theme_minimal(), and
theme_dark().
For convenience, you can also store plots to an object and add additional features onto that object:
# save plot in an object called p
p <- ma |>
ggplot(
aes(
x = year,
y = democrat
)
) +
geom_line(color = "grey") +
geom_point(aes(color = winner)) +
labs(
title = "Massachussets Presidential Election Results",
subtitle = "1932-2016",
x = "Election Year",
y = "Democratic %"
)
# now we can make more customizations to p
# without retyping everything
p + theme_minimal()
p + theme_dark()
There are many, many more themes available via packages like
ggthemes.
library(ggthemes)
This opens up many many more themes for you, many of which are listed at this link. Here are a few:
p + theme_clean()
p + theme_fivethirtyeight() # 538
p + theme_igray() # Gray background
p + theme_economist() # The Economist
p + theme_stata() # theme from a language called Stata
p + theme_solarized()
You can edit almost anything you want about a plot’s
theme, even if you’ve already added a preset theme.
Most of this works happens through the theme() function.
You can run ?theme to get a full list of options. For
example:
p +
theme_bw() +
theme(legend.position = "bottom")
Often, you will want to plot several groups at once. However, putting all information on one plot can be overwhelming. For example, consider this plot:
northeast <- elections |>
filter(region == "Northeast")
northeast |>
mutate(winner = democrat > 50) |>
ggplot(
aes(
x = year,
y = democrat,
color = winner
)
) +
geom_point()
Why is this so cluttered? Well, we are now plotting results from all
states in the Northeast! We could color by state instead,
but that might look overwhelming:
northeast |>
ggplot(
aes(
x = year,
y = democrat,
color = state
)
) +
geom_point()
Wow! That looks terrible. Instead, what if we plotted a separate line for each state?
northeast |>
ggplot(
aes(
x = year,
y = democrat,
color = state
)
) +
geom_point() +
geom_line()
That looks a little better, but it is still difficult to tell each
line apart from one another. What if we made a smaller plot for each
state and combined them? This is what a facet is. If we
ask for a facet_wrap() by state, R will make one plot per
state:
northeast |>
ggplot(aes(x = year, y = democrat)) +
geom_point() +
geom_line() +
facet_wrap(~state) + # notice the ~ key (called a tilde)
theme_linedraw()
We could also add the winner color back and
facet_wrap() will automatically apply it to each plot:
northeast |>
mutate(winner = democrat > 50) |>
ggplot(
aes(
x = year,
y = democrat
)
) +
geom_line() +
geom_point(aes(color = winner)) +
facet_wrap(~state) + # notice the ~ key (called a tilde)
theme_linedraw() +
labs(
x = "Election Year",
y = "Democratic %",
title = "Presidential Elections",
subtitle = "1932-2016, Northeastern States"
)
Does the font size look too small to you? There are many many ways of
customizing ggplot() objects, many of which we will learn
throughout the course. Here
is a helpful cheatsheet with many of the options listed in case you
would like to delve deeper into this.
# For example, functions like axis.text() and axis.title() change font sizes
# for particular places on your plot.
northeast |>
ggplot(
aes(
x = year,
y = democrat
)
) +
geom_point() +
geom_line() +
facet_wrap(~state) + # notice the ~ key (called a tilde)
theme_linedraw() +
theme(
strip.text = element_text(size = 25),
axis.text = element_text(size = 15),
axis.title = element_text(size = 15)
)
pop dataset below contains state population data
over time. For any state you want, make a plot showing population by
year for every year after 1960.pop <- read_csv("https://www.dropbox.com/s/javbnd4c3n67380/state_population.csv?dl=1")
## Rows: 6020 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): state, region
## dbl (2): year, population
## lgl (1): after2000
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Write your code here.
# Write your code here.
Please follow the submission instructions listed here. We suggest you submit your assignments as you finish them (i.e., don’t wait until you have completed them all to submit).